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This  report  outlines  results  on  each  of  the  six  main  tasks  specified  in  the  project.  For  convenience,  the 
deliverables  specified  in  the  project  are  included  in  slanted  font  and  the  report  is  given  following  each 
deliverable.  The  deliverables  arc  given  here  in  the  order  most  suited  to  the  report,  but  numbered  as  in  the 
original  project. 

In  order  to  realize  a  general  method  to  identify  social  communities  residing  among  the  peripheral  end- 
user  nodes  of  a  network,  the  deliverables  of  the  proposed  research  are: 

3.  Develop  a  theory  for  conditional  reliability.  A  report  will  be  delivered  that  describes  a  model  for  the 
significance  of  the  relationships  through,  rather  than  to,  the  core  communities  involved  in  network 
operation.  The  relationships  will  be  inferred  from  common  patterns  of  interaction  with  the  highly 
interrelated  core  communities.  The  key  result  will  be  measures  of  the  strengths  of  relationships  among 
peripheral  nodes  when  they  are  much  more  closely  related  to  the  core  communities  than  to  each  other. 

and  4.  Develop  a  set  of  computational  methods  for  conditional  reliability.  A  report  describing  computational 
methods  that  identify  weaker  social  interactions  in  the  presence  of  very  strong  relationships  in  the 
physical  network  and  strong  ones  in  the  logical  network  will  be  delivered. 

These  have  resulted  in  the  development  of  reliability  measures  that  have  not  been  previously  studied. 
Before  commencing  this  project,  we  had  updated  a  survey  of  current  techniques  for  computation  of 
network  reliability  [13].  In  our  application,  however,  it  is  necessary  to  determine  that  a  community  of 
nodes  is  interconnected.  The  basic  reliability  question  is  to  determine  the  probability  that  a  specified 
number  of  nodes  is  connected  when  some  fixed  nodes  arc  to  be  included,  and  the  remainder  arc  to  be 
chosen  from  a  subset  of  nodes.  Working  with  a  PhD  student,  Toni  Farley,  we  have  unified  a  broad 
collection  of  measures  of  this  type,  and  conducted  a  systematic  literature  review.  A  first  version  of 
the  literature  review  is  given  in  [17],  and  an  abbreviated  version  appeal's  in  [20].  Because  of  the 
complexity  of  these  computations,  efficient  algorithms  for  sparse  networks  have  been  devised,  and 
are  reported  in  [19]. 

Some  details  about  the  new  measures  follow:  The  ^-terminal  reliability  measures  are  natural  in  net¬ 
work  analysis,  when  one  wants  to  know  the  probability  that  k  specified  nodes  can  communicate  with 
each  other  in  a  given  network.  In  network  design,  however,  while  it  may  be  known  that  /e-terminal 
operations  will  arise,  it  is  unlikely  that  the  identity  of  the  specific  nodes  involved  is  known.  These 
considerations  motivated  the  definition  of  (two-terminal)  resilience  [5],  the  average  two-terminal  re¬ 
liability  over  all  choices  K  C  V  with  \K\  =  2.  This  measures  the  expected  ability  of  the  network 
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to  support  a  two-terminal  operation,  where  the  pair  of  terminals  is  selected  uniformly  at  random.  A 
generalization  to  k-resilience  [18]  instead  yields  the  average  ^-terminal  reliability  over  all  choices 
K  C  V  with  \K\  =  k. 

Both  ^-terminal  reliability  and  k -resilience  address  the  question:  What  is  the  probability  that  k  nodes 
can  communicate?  The  difference  is  that  for  /c-terminal  reliability,  the  k  communicating  nodes  are  the 
k  targets  chosen  in  advance,  while  for  k -resilience,  the  k  are  chosen  uniformly  at  random.  There  is 
another  natural  way  to  interpret  the  question  posed,  when  we  arc  concerned  with  the  existence  of  any 
k  nodes  in  a  connected  component.  We  formalize  the  differences  among  these  three  interpretations  by 
defining  a  common  generalization  of  all  three.  Let  Q  =  (G,  p,  p)  with  G  =  ( V. ,  E)  be  a  probabilistic 
graph.  Let  H  C  V,  \H\  =  h;  these  arc  the  target  nodes.  For  a  set  A  C  V  and  integer  j,  define 
\F(G,  L,j)  to  be  1  if  G  contains  a  connected  component  containing  all  vertices  of  L  and  at  least  j 
other  vertices,  0  otherwise. 

Define  Con((G  =  (V,  E),  p,  1),  H;  h,  i,j )  to  be 


'^I-IV}H  SfcB  ((rieSFFe  rieeE\F(-*-  Pe)^J  ^  {(V,  E) ,  H  U  I ,  j)^J 


An  explanation  in  plain  language  is  in  order.  When  all  nodes  operate,  this  represents  the  expectation 
that  k  =  h  +  i  +  j  nodes  arc  connected,  where  the  h  nodes  of  H  arc  selected  in  advance,  the  i  nodes 
of  I  are  chosen  uniformly  at  random  among  the  remaining  nodes,  and  the  j  nodes  can  then  be  any 
remaining  nodes. 

Incorporating  node  failures  in  the  definition  is  straightforward  by  considering  the  subgraph  induced 
on  the  operational  nodes.  For  a  set  W  C  V,  define  E\y  =  {e  G  E  :  \e  n  W\  =  2};  in  other  words, 
E\y  contains  the  edges  of  the  subgraph  Gw  induced  on  node  set  W.  Then  define 

Con((G  =  (V,E),p,p),H;h,i,j)  = 

hcxcv  ^Con((Gx,  P,  1),  H;  h,  i,j )  IXrex  Px  rixev\x(^  —  P®)) 

The  definition  at  first  seems  somewhat  unwieldy.  Nevertheless,  taking  h  =  k  and  i  =  j  =  0,  we 
obtain  ^-terminal  reliability.  Taking  i  =  k  and  h  =  j  =  0,  we  obtain  A; -resilience.  Taking  j  =  k 
and  h  =  i  =  0,  we  obtain  the  probability  that  Q  contains  a  component  of  size  at  least  k,  which  we 
term  the  kSet  problem.  The  definition  also  permits  the  analysis  of  more  involved  questions,  such  as 
determining  the  probability  that  h  given  nodes  lie  in  a  component  of  size  at  least  k;  or  the  probability 
that  i  nodes  chosen  uniformly  at  random  lie  in  a  component  of  size  at  least  k.  Such  problems  arise 
more  frequently  in  reliability  analysis  than  one  might  expect;  see  [17,  20]. 

6.  Develop  clustering  techniques  to  identify  core  communities.  A  report  will  be  delivered  that  describes 
(1 )  the  use  of  reliability  computations  in  conjunction  with  clustering;  and  (2)  the  use  of  density-based 
approaches  for  the  heuristic  identification  of  clusters.  In  order  to  isolate  communities  at  the  periph¬ 
ery  of  the  network,  a  necessary  step  is  the  determination  of  the  “center”  or  “core”  of  the  network. 
Clustering  techniques  based  on  density  and  on  k -nearest-neighbour  approaches  will  be  compared  to 
determine  central  clusters  that  are  expected  to  form  the  physical  backbone  of  the  network. 

The  multiterminal  resilience  techniques  developed  in  [19,  20]  (described  under  Tasks  3  and  4  above) 
underlie  a  ranking  of  nodes.  The  easiest  application  is  to  determine  the  ‘influence’  of  a  node  using  its 
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ability  to  communicate  with  each  other  node.  We  examined  more  sophisticated  applications  in  [24]. 
Most  relevant  is  the  ability  to  rank  nodes  by  their  ability  to  connect  communities  of  specified  sizes  to 
determine  an  importance  for  the  node.  Importance  alone  does  not  form  a  core  cluster.  The  use  of  these 
measures  provides  a  means  for  determining  the  importance  of  a  node  as  a  function  of  the  reliability 
of  each  other  node,  that  is  determining  the  contribution  of  each  node  to  the  importance  of  each  other. 
Computational  tools  have  been  developed  to  make  these  calculations  for  small  networks. 

5.  Assess  the  accuracy  versus  efficiency  of  reliability  computations  for  transitive  closure  of  relationships 
whose  strengths  are  represented  by  probabilities.  A  report  that  describes  the  modelling  of  social 
networks  using  metrics  based  on  network  reliability  will  be  delivered.  This  will  investigate  whether 
relationships  are  adequately  captured  by  considering  strengths  of  connections  between  pairs  of  nodes, 
or  whether  interactions  among  many  entities  are  needed  to  capture  social  behaviour.  It  will  also 
examine  efficient  bounding  techniques  for  the  relevant  reliability  measures  to  determine  the  feasibility 
of  obtaining  sufficiently  accurate  estimates  of  the  strength  of  relationship  for  large  networks. 

Our  work  has  focussed  on  citation  networks.  Under  our  direction,  an  Honors  undergraduate  student, 
David  Weber,  has  collected  data  and  formed  large  networks  based  on  direct  citation  and  indirect  coc¬ 
itation  and  bibliographic  coupling  data  [24] .  The  analysis  of  these  networks  indicates  that  efficient 
bounding  techniques  based  on  edge-decompositions  of  the  networks  (in  particular,  the  edge-disjoint 
pathset  and  cutset  bounds  and  the  consecutive  pathset  and  cutset  bounds  referenced  in  [13])  are  suf¬ 
ficiently  accurate  to  distinguish  among  links  and  nodes.  The  inherent  asymmetry  of  the  networks  in¬ 
volved  appeal's  to  underlie  the  success  of  these  methods.  Limited  application  of  the  factoring  method 
(see  [13])  suffices  in  all  cases  examined  to  distinguish  among  groups  of  nodes  with  similar  reliabili¬ 
ties.  It  is  desirable  still  to  adapt  these  bounds  to  the  novel  resilience  measures  discussed  under  Task 
4.  (This  was  begun  by  a  Master’s  student,  Kumaraguru  Paramavisam,  who  left  the  program  without 
completing  the  work.)  In  the  interim,  we  have  employed  a  crude  Monte  Carlo  method. 

Modelling  social  networks  based  on  their  use  of  various  communications  networks  relies  on  the  de¬ 
velopment  of  appropriate  link  and  node  probabilities,  which  we  discuss  further  under  Tasks  1  and  2, 
below. 

1 .  Identify  community  usage  patterns.  A  report  identifying  possible  usage  patterns  that  are  characteristic 
of  a  community  (i.e.,  common  to  the  community  but  not  frequently  employed  by  other  nodes)  will 
be  delivered.  This  includes  the  types  of  communications  generated,  the  distributions  of  volumes  and 
times  of  such  communications,  and  the  interaction  with  network  services  whose  function  is  known. 

and  2.  Quantify  the  strength  of  the  usage  patterns.  A  report  will  be  delivered  that  describes  the  relative 
merits  of  quantitative  representations  (particularly  those  based  on  sociometric  and  scientometric  mea¬ 
sures)  for  capturing  inferred  social  relationships.  The  relationships  derived  will  measure  both  social 
relationships  and  relationships  reflecting  network  connection,  which  must  be  differentiated. 

Our  work  has  again  focussed  on  citation  networks,  which  represent  a  very  simplified  type  of  com¬ 
munication  underlying  a  social  network.  These  networks  are  severely  limited  by  the  types  of  com¬ 
munications  measured,  and  hence  investigation  of  internet  data  has  been  done.  CAIDA  [15]  provides 
a  repository  of  large-scale  measurement  data  on  various  Internet  functions;  in  addition,  many  public 
domain  tools  are  available  for  analysis  of  these  data  [16],  The  key  concern  is  that  as  a  result  in  the 
analysis  of  social  structure  the  problem  is  ‘data-rich’  but  ‘information-poor’.  The  identification  of 
possible  usage  patterns  that  are  characteristic  of  a  community  do  not  appeal'  to  require  data  beyond 
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that  already  collected  through  link,  path,  and  network  monitoring.  Despite  this,  the  differentiation  be¬ 
tween  network  and  social  communication,  and  among  various  types  of  social  communication,  appears 
to  require  a  better  initial  characterization  of  link  types  and  strengths  in  the  communications  network 
itself.  Patterns  of  communication  involve  the  types  of  traffic  involved,  the  types  of  nodes  involved,  and 
the  sequence  and  the  timing  of  communication.  Our  results  indicate  that  treating  each  of  these  char¬ 
acteristics  independently  is  not  sufficiently  discriminating  to  separate  physical  communication  from 
the  logical  and  social  communication  that  it  supports.  On  a  positive  note,  the  indirect  methods  using 
cocitation  analysis  and  bibliographic  coupling  do  infer  connections  between  actors  that  participate  in 
similar  types  of  communication;  however,  classifying  the  types  of  communication  to  determine  this 
similarity  is  problematic. 

Even  in  the  citation  data  analyzed,  it  is  understood  that  different  actual  citations  serve  different  social 
purposes.  Some  arc  accepted  as  legitimate,  such  as  those  to  well-known  prior  art,  to  key  background 
(including  relevant  self-citation),  or  to  related  work  cited  to  draw  contrasts  with  the  intended  con¬ 
tribution.  Others  serve  a  different  purpose  unrelated  to  the  contribution,  for  example  to  well-known 
figures  in  the  field  in  an  (often  misguided)  attempt  to  suggest  the  importance  of  the  work,  gratuitous 
self-citation,  and  attempts  to  manipulate  the  current  ranking  methods  for  impact.  Our  efforts  to  dis¬ 
criminate  among  these  types  of  social  communication  have  been  severely  constrained  in  two  ways. 
First,  the  specific  factors  impacting  the  type  and  importance  of  a  specific  citation  are  unclear,  and  the 
number  of  factors  to  be  considered  is  large.  Secondly,  the  factors  do  not  act  in  isolation  from  one 
another:  Interactions  among  the  factors  arc  crucial.  As  a  simple  example  from  citation  networks,  the 
impacts  of  the  author,  author’s  institution,  and  the  paper  cited  are  correlated  with  the  type  of  citation. 
But  the  impact  of  the  journal,  both  at  the  time  that  the  cited  paper  was  published  and  at  the  time  that 
the  citing  paper  was  published,  correlate  with  the  type  of  citation  -  and  indeed  the  change  in  both 
rank  and  impact  of  the  journal  affect  the  type  of  the  citation.  In  communications  networks,  we  expect 
these  problems  to  be  more  severe.  Many  more  factors  are  present,  and  many  more  interactions  should 
be  anticipated. 

An  effective  classification  of  links  by  type  of  communication  is  a  prerequisite  to  the  identification  of 
communities  with  sufficient  precision.  Our  research  has  therefore  focussed  on  the  determination  and 
measurement  of  interactions  among  the  many  factors  that  are  measured  for  physical  links.  Standard 
design-of-experiments  techniques  are  inadequate  in  this  context  for  a  number  of  reasons.  One  is  that 
many  of  the  factors  in  network  operation  are  measurable  but  not  controllable.  More  importantly, 
before  interactions  can  be  measured,  screening  is  needed  to  find  the  factors  and  interactions  that  may 
be  relevant. 

This  problem  arises  in  numerous  different  settings:  screening  using  D-optimal  designs  [21],  the  lo¬ 
cation  of  interaction  faults  [12],  approximate  measurement  using  small  sample  spaces  [1],  internet 
tomography  [4],  and  compressive  sensing  ([2],  for  example).  These  apparently  different  research  ar¬ 
eas  all  concern  the  identification  and  measurement  among  factors  in  which  a  signal  or  sample  has 
a  ‘sparse’  representation.  Our  research  to  date  has  concerned  the  unification,  to  the  extent  possible, 
of  these  different  lines  of  investigation.  We  have  established  that  the  similarities  among  these  topics 
are  not  just  cosmetic,  rather  they  arise  from  a  deep  connection  in  the  underlying  mathematics.  Al¬ 
though  this  may  appear  to  be  a  detour  on  the  road  to  effective  link  characterization,  we  believe  that 
understanding  the  fundamental  similarities  and  differences  among  these  is  a  necessary  next  step  in 
finding  a  sufficiently  accurate  methodology  to  classify  links,  and  in  turn  to  use  that  link  classification 
to  isolate  communities.  In  addition  to  providing  the  foundation  for  the  specific  problem  in  identifying 
social  communities,  this  line  of  investigation  can  pay  dividends  in  interaction  fault  location,  signal 
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processing,  and  sampling  in  sparse  spaces  generally. 

In  this  line,  we  have  completed  a  number  of  papers  developing  the  foundations  for  the  applications  to 
factor  location  and  screening.  In  [7,  8],  powerful  direct  constructions  using  number  theoretic  meth¬ 
ods  have  been  developed.  In  [9],  new  recursive  methods  are  developed.  In  [22,  23],  a  powerful 
computational  search  technique  is  developed.  In  [6,  14]  a  substantial  generalization  of  ‘perfect  hash 
families’  is  introduced;  these  underlie  worthwhile  improvements  in  a  column  replacement  strategy  for 
combinatorial  arrays;  in  [10,  11]  an  algebraic  construction  of  such  hash  families  is  developed.  These 
efforts  arc  now  being  connected  with  compressive  sensing,  through  the  work  of  Colbourn’s  new  PhD 
student,  Chris  McLean;  and  with  error  location  through  the  work  of  Syrotiuk’s  new  PhD  student, 
Abraham  Aldaco. 

In  the  references  to  follow,  research  that  is  published  or  submitted  and  was  funded  partially  or  wholly 
under  this  project  is  marked  with  an  asterisk  (*). 
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